InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.
Video-to-Text
Transformers English